Preference Dataset Creation for DPO Fine-Tuning

Leveraging LLMs for creating preference dataset for DPO Fine-Tuning

preference-dataset
Author

Quang Duong

Published

August 27, 2024

Introduction

Direct Preference Optimization (DPO) is a technique used to align AI-generated outputs with human preferences by optimizing language models. To achieve this, a preference dataset is required, containing data that enables models to understand which responses are preferred by humans and which are not. In this article, we’ll walk through a code implementation to create such a dataset using Python, OpenAI’s API, and Hugging Face’s Datasets library.

Components of a Preference Dataset for DPO

A preference dataset typically includes:

Prompts: Inputs or questions given to the AI model. Chosen Responses: AI-generated responses preferred by human evaluators. Rejected Responses: Less preferred responses or responses not selected by human evaluators. By providing this structure, the dataset allows a model to learn which responses are preferable, making it better aligned with human preferences.

Our use-case

In our previous post, we created an instruction dataset, TinyStories_Instruction, from the raw TinyStories dataset. This dataset was specifically designed for fine-tuning a pretrained Large/Small Language Model using LORA/QLORA to develop a story generator tailored to 5-year-olds.

In this guide, we take the next step by creating a preference dataset from the previously generated instruction dataset. This dataset is used for fine-tuning a pretrained Large/Small Language Model through Direct Preference Optimization (DPO), enhancing our story generator to align even better with human preferences and produce engaging, age-appropriate content for young children.

The process for creating a preference dataset is illustrated below:

Implementation

This implementation involves a series of steps: extracting data, generating AI responses, and creating preference triplets.

import concurrent.futures
import json
from concurrent.futures import ThreadPoolExecutor
from typing import List, Tuple
from datasets import Dataset, load_dataset, concatenate_datasets
from openai import OpenAI
from tqdm.auto import tqdm
from google.colab import userdata

1. Data Extraction Function

The extract_ground_instruction_story function extracts pairs of instructions and desired outputs from a given dataset.

def extract_ground_instruction_story(dataset):
    return [(example['instruction'], example['output']) for example in dataset]

2. Creating a PreferenceSet Class

The PreferenceSet class manages and stores the triples of (instruction, generated story, desired story).

class PreferenceSet:
    def __init__(self, triples: List[Tuple[str, str, str]]):
        self.triples = triples

    @classmethod
    def from_json(cls, json_str: str, instruction, desired_story) -> 'PreferenceSet':
        data = json.loads(json_str)
        triples = [(instruction, data['generated_story'], desired_story)]
        return cls(triples)

    def __iter__(self):
        return iter(self.triples)

3. Generating Preference-Response Triplets

This function generates a story using OpenAI’s API and returns a preference triple in the format (instruction, generated response, desired response).

def generate_preference_answer_triples(instruction: str, desired_story: str, client: OpenAI) -> List[Tuple[str, str, str]]:
    prompt = f"""Based on the following instruction, generate a story. \
        Story should be no longer than 50 words. Story uses several complex words or structures \
        that are not suitable for 5-year-olds.

        Provide your response in JSON format with the following structure:
        {{"generated_story": "..."}}

        Instruction:
        {instruction}
        """
    completion = client.chat.completions.create(model="gpt-4o-mini",
                                                    messages=[
                                                        {"role": "system",
                                                        "content": "You are a helpful assistant who \
                                                        generates story based on the given instruction. \
                                                        Provide your response in JSON format.",},
                                                        {"role": "user", "content": prompt},
                                                        ],
                                                    response_format={"type": "json_object"},
                                                    max_tokens=512,
                                                    temperature=0.2,)
    result = PreferenceSet.from_json(completion.choices[0].message.content, instruction, desired_story)

    # Convert to list of tuples
    return result.triples

4. Creating the Preference Dataset

This function creates a dataset using the extracted stories and generated responses.

def create_preference_dataset(dataset: Dataset, client: OpenAI, num_workers: int = 4) -> Dataset:
    stories = extract_ground_instruction_story(dataset)
    instruction_answer_triples = []

    with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
        futures = [executor.submit(generate_instruction_answer_triples, instruction, desired_story, client) for instruction, desired_story in stories]

        for future in tqdm(concurrent.futures.as_completed(futures), total=len(futures)):
        instruction_answer_triples.extend(future.result())

    instructions, rejected_story, chosen_story = zip(*instruction_answer_triples)
    return Dataset.from_dict({
        "prompt": list(instructions),
        "rejected": list(rejected_story),
        "chosen": list(chosen_story)
        })

5. The main function

This function initializes the OpenAI client, loads the dataset, creates a preference dataset, and uploads it to the Hugging Face Hub.

def main() -> Dataset:
    client = OpenAI(api_key=userdata.get('OPENAI_API_KEY'))

    # 1. Load the raw data
    # Load the train and test splits
    train_dataset = load_dataset("tanquangduong/TinyStories_Instruction", split="train")
    test_dataset = load_dataset("tanquangduong/TinyStories_Instruction", split="test")

    # Combine the datasets
    raw_dataset = concatenate_datasets([train_dataset, test_dataset])

    print("Raw dataset:")
    print(raw_dataset.to_pandas())

    # 2. Create preference dataset
    preference_dataset = create_preference_dataset(raw_dataset, client)
    print("Preference dataset:")
    print(preference_dataset.to_pandas())

    # 3. Train/test split and export
    filtered_dataset = preference_dataset.train_test_split(test_size=0.1)
    filtered_dataset.push_to_hub("tanquangduong/TinyStories_Preference")

6. Hugging Face Hub Login

To authenticate with Hugging Face and run the pipeline:

from huggingface_hub import login
# Log in to the Hugging Face Hub
login(token=userdata.get('HF_TOKEN'))

# Launch the pipeline to create instruction dataset
main()

Conclusion

The article outlines the process of creating a Preference Dataset for Direct Preference Optimization (DPO) to align AI-generated outputs with human preferences. It focuses on enhancing a story-generation model for 5-year-olds by building on a previously created instruction dataset. The dataset consists of prompts, human-preferred responses, and rejected responses, allowing the model to learn desired behavior. Key steps include extracting instruction-output pairs, generating AI responses using OpenAI’s API, and organizing the data into preference triplets. The final dataset, prepared for fine-tuning, is uploaded to the Hugging Face Hub, improving the AI’s ability to produce engaging, age-appropriate content.